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ABSTRACT 


Sequences of the large subunit (LSU) of ribosomal DNA from Arabidopsis thaliana, Brassica napus, Sinapsis 
alba, Oryza sativa, Fragaria x ananassa, Lycopersicon esculentum, and Citrus limon were analyzed for nucleotide 
composition, presence of “cryptic” sequence simplicity, and evidence of molecular coevolution among the 12 expansion 
segments. The median value for GC content across the seven plants for the LSU was 56%, but the distribution of 
GC was nonrandom. Expansion segments were decidedly more GC rich (6576 on average) than were the conserved 
core regions (52% on average). Only Oryza sativa had significant cryptic sequence simplicity, which was found to 
be greatest in expansion segments D8 and D12. Sequence similarity between expansion segments also was strongest 
in rice as determined by visual inspection of dot plots. The complex nature of sequence variation in the LSU of rDNA 


complicates the use of this molecule as a molecular systematic marker. 


Choosing an appropriate gene(s) is one of the 
most important steps for any molecular systematics 
study. The nature of sequence variation in a given 
gene influences all subsequent steps in the analysis, 
including the quality of the multiple sequence align- 
ment (which serves as the basis for hypotheses of 
positional homology for each nucleotide or amino 
acid), and the outcome of the algorithm used for 
phylogenetic reconstruction. Friedlander et al. 
(1992) described several criteria for selecting "ide- 
al' molecular systematic markers, including (1) 
genes that are present in single copies (or if present 
in multiple copies can be differentiated one from 
another or are homogeneous within a species), (2) 
genes or regions that are longer than 500 base 
pairs, (3) genes that contain both conservative and 
variable regions, (4) genes that lack significant 
nucleotide composition bias, and (5) genes that lack 
many or long introns. In addition to these struc- 
tural/compositional factors, it also is desirable to 
know something of the modes of sequence variation 
patterns across a molecule to assess the potential 


for non-independence of characters and other vi- 
olations of the assumptions which are made by 
different algorithms used for phylogenetic recon- 
struction. 

The molecular biology, evolution, and general 
biosystematic utility of rDNA have been reviewed 
extensively and will not be addressed in detail here 
(Arnheim, 1983; Dover & Flavell, 1984; Gerbi, 
1985; Appels & Honeycutt, 1986; Flavell, 1986; 
Schaal & Learn, 1988; Jorgensen & Cluster, 1988; 
Knaack et al., 1990; Larson, 1991; Hillis & Dixon, 
1991). For systematic analyses, the small subunit 
of rDNA has been used most often, especially for 
deep evolutionary divergences (e.g., Zimmer et al., 
1989; Nickrent & Soltis, 1995, this issue). Re- 
cently, the internal transcribed spacer (ITS) regions 
that flank the 5S coding region in the rDNA repeat 
have gained attention as suitable markers for inter- 
and intra-generic comparisons in plants (e.g., Bald- 
win, 1992: Baldwin et al., 1995, this issue). The 
LSU has been used to a limited extent in plant 
systematics and all of the published nucleotide se- 
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TABLE 1. Complete nuclear LSU rDNA sequences for plants in GenBank. 


Taxon Common name 


Arabidopsis thaliana mouse-ear cress 


Brassica napus rapeseed 
Sinapsis alba mustard 
Citrus limon citrus 
Fragaria х ananassa strawberry 
Lycopersicon esculentum tomato 
Oryza sativa rice 


GenBank 

accession Reference 

X52320 Unfried & Gruendler, 1990 
D10840 Okumura & Shimada, 1992 
X57137 unpublished 

X05910 Kolosha & Fodor, 1990 
X58118 unpublished 

X13557 Kiss et al., 1989 

M11585 Sugiura et al., 1985 





quence-based analyses to date have used partial 
LSU sequence data (Hamby & Zimmer, 1988; 
Kantz et al., 1990; Zechman et al., 1990; Bult 
& Zimmer, 1993). 

The rDNA LSU is a mosaic consisting of cores 
of highly conserved regions interspersed with 12 
regions of variable size called “ехрапѕіоп seg- 
ments" (Clark et al., 1984) or “divergent domains" 
(Hassouna et al., 1984). The number and relative 
positions of the expansion segments within the LSU 
are highly conserved among a wide range of taxa. 
Most of the fluctuations in length associated with 
expansion segments reflect the gain and loss of 
short, directly repetitive sequence motifs by a slip- 
page-like mechanism during turnover (Hancock & 
Dover, 1988). Tautz et al. (1986, 1988) coined 
the term “‘cryptic sequence simplicity" to describe 
the sequence footprints of the motifs after they are 
shuffled among themselves by repeated slippage 
events. Unlike tandem arrays of unshuffled direct 
repeats (pure simplicity), cryptic simplicity is not 
easy to detect by simple visual inspection of a 
sequence. The elevated rates of sequence variation 
observed in expansion segments suggest that these 
regions lack function and, therefore, tolerate rel- 
atively high rates of point mutations and structural 
changes (Gerbi, 1985; Hancock & Dover, 1988; 
Tautz et al., 1988). However, the following three 
lines of evidence suggest that expansion segments 
do play a functional role, or at least evolve under 
selective constraints: (i) transcripts of the expansion 
segments are present in the mature 28S rRNA of 
some eukaryotes (Hassouna et al., 1984), (ii) dot 
plot comparisons of the complete LSU sequence 
from Drosophila demonstrate that expansion seg- 
ments coevolve (Hancock & Dover, 1988), and 
(iii) expansion segments show a high degree of 
secondary structure conservation (Hancock et al., 
1988). One hypothesis for the observed patterns 
of sequence change among and within the expan- 
sion segments is that they play some role in main- 
taining the steric integrity of the ribosome (Gerbi, 
1985). 


The complex mode of sequence variation ob- 
served in the LSU potentially complicates the use 
of this molecule as a phylogenetic marker. Motif 
shuffling and molecular coevolution among expan- 
sion segments will violate assumptions of character 
independence and compromise hypotheses of ho- 
mology at many nucleotide positions in a multiple- 
sequence alignment. The purpose of the present 
study is to report on the nature of sequence vari- 
ation across the large subunit (LSU) of ribosomal 
DNA (rDNA) in plants and discuss the implications 
of the observed patterns of variation for using the 
LSU as a molecular marker in phylogenetic anal- 
yses. We compare the conserved core and expan- 
sion segment region for full-length sequences of 
the large subunit of rDNA in plants to (1) assess 
nucleotide composition, (2) test for cryptic se- 
quence simplicity, and (3) look for evidence of 
molecular coevolution among expansion segments. 


MATERIALS AND METHODS 
FULL-LENGTH LSU SEQUENCES FROM GENBANK 


Seven full-length rDNA LSU sequences were 
found in version 79 of GenBank. The names and 
database accession numbers of these sequences are 
given in Table 1. Many more partial LSU sequence 
fragments are available in GenBank, but were not 
included in this analysis. 


SEQUENCE SIMPLICITY 


Sequence simplicity was analyzed using the 
SIMPLE34 program (Tautz et al., 1986; Hancock 
& Armstrong, 1994) running on a Sun SPARC- 
station 2 computer. The SIMPLE34 program is 
available via anonymous ftp or gopher from 
life.anu.edu.au or by sending an electronic mail 
message to John.Hancock(ganu.edu.au. Sequence 
simplicity is evaluated by counting the number of 
tri- and tetranucleotides along the length of a DNA 
sequence using a 32-bp sliding window. The min- 
imum length of a sequence that can be analyzed 
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by this method is 68 bp (Hancock & Armstrong, 
1994). The statistical significance of the trimer 
and tetramer count is determined by comparing 
the results for the test sequence to values obtained 
for a set of ten randomly shuffled sequences (10,000 
nucleotides long) having the same base composition 
as the test sequence. The score comparing the test 
sequence to the random sequence is called a Rel- 
ative Simplicity Factor (RSF). An RSF value great- 
er than 1 for a test sequence is statistically sig- 
nificant if the raw simplicity factor score for the 
test sequence is greater than 3 standard deviations 
of the simplicity factor value for the ten randomized 
sequences. RSF values significantly greater than 1 
have cryptically simple direct repeats; RSF values 
significantly less than 1 have cryptically simple 
inverted repeats. 

An underlying bias in dinucleotide composition 
for a particular sequence could, in turn, bias the 
statistical assessment of trimer and tetramer motifs 
in the RSF analysis. To correct for underlying 
dinucleotide composition bias, sequences that gave 
significant RSF scores were re-analyzed using a 
2nd order Markov rule (Hancock & Armstrong, 


1994). 


DOT PLOT ANALYSIS 


Dot plot comparisons were performed using the 
ABI Inherit Analysis software (v. 1.1; Applied Bio- 
systems, Inc.) on a MacIntosh Quadra 900 com- 
puter to examine levels of sequence similarity be- 
tween expansion segments (internal sequence 
similarity). The LSU of Escherichia coli (Brosius 
et al., 1980; GenBank Accession V00331) was 
included as a negative control as it contains no 
expansion segments (Clark et al., 1984; Hancock 
& Dover, 1988); the LSU of Homo sapiens (Gon- 
zalez et al., 1985; GenBank Accession M11107) 
was included for comparison because it shows strong 
patterns of expansion segment similarities (Han- 
cock & Dover, 1988). Internal sequence similarity 
was scored as present or absent by visual inspection 
of the dot plots. Dot plots were generated using a 
sliding window of 9 bp (with an offset of 3 bp) and 
an error tolerance of 3396. These parameters gave 
the best signal-to-noise ratio for detecting evidence 
of internal sequence similarity in plants. Because 
the human sequence has stronger patterns of in- 
ternal sequence similarity than that observed for 
plants, the window was expanded to 10 bp (offset 
of 3; error tolerance of 20%) to decrease the back- 
ground. Sequence simplicity profiles generated by 
the SIMPLE34 program were plotted along the X 


and Y axes of each dot plot to aid in visual inter- 


pretation. The profiles are graphical representa- 
tions of levels of sequence simplicity over the length 
of the test sequence, averaged over blocks of 10 
nucleotides. 


TERMINOLOGY 


We use the term expansion segment instead of 
divergent domain in this report because of the 
multiple meanings of the word domain in molecular 
biology. However, we retain the standard naming 
scheme of D1 through 012 to refer to individual 
expansion segments. 


RESULTS AND DISCUSSION 
NUCLEOTIDE COMPOSITION 


The average GC content of plant LSU sequences 
is 56% (Table 2). The distribution of GC is not, 
however, randomly distributed throughout the 
length of the sequence. For all seven species ex- 
amined, expansion segments have a higher average 
GC content (65%) than do the conserved core 
regions (52%). 


SEQUENCE SIMPLICITY 


Of the seven plant species analyzed, only rice 
(Oryza sativa) has statistically significant sequence 
simplicity (Table 2). The significance is retained 
even after a 2nd order Markov correction for un- 
derlying dinucleotide composition bias (data not 
shown). Although their overall RSF values are not 
significant, three of the taxa examined, Arabidop- 
sis, Sinapsis, and Fragaria, show over-represen- 
tation of a particular tetramer motif (CGGC, CGGA, 
and AAAC, respectively; data not shown). Analysis 
of the 12 expansion segments individually (Table 
3) reveals that much of the overall simplicity in 
plants is concentrated in expansion segments D8 
and D12. The RSF values for D8 are significant 
for Arabidopsis, Citrus, Fragaria, Lycopersicon, 
and Oryza; the RSF values for D12 are significant 
for Arabidopsis, Citrus, and Oryza. After cor- 
recting for underlying dinucleotide composition bias, 
only the RSF values for Oryza remain statistically 
significant. In some vertebrates (human, mouse, 
rat), RSF values are greatest in expansion segments 
D2, D6, D8, and 012 (Hancock & Dover, 1988). 

We also applied sequence simplicity analysis to 
published full-length sequences of the LSU from 
plant chloroplast and mitochondrial genomes and 
to nuclear small subunit (16/185, SSU) rDNA. 
The RSF values for these sequences are not sta- 
tistically significant (data not shown). 
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TaBLE 2. Nucleotide composition, relative simplicity factor (RSF) values, and internal sequence similarity for full- 


length sequences of LSU rDNA in plants. Homo sapiens and Escherichia coli are included for comparison (see text). 











o, GC* Internal 
Length sequence 

Taxon (bp) ALL CORE ES RSF similarity 
Arabidopsis thaliana 3375 55 52 63 1.031 t/— 
Brassica napus 3378 54 51 61 0.984 == 
Sinapsis alba 3381 55 52 61 1.015 = 
Citrus limon 3393 57 53 68 1.080 £ 
Fragaria Xananassa 3377 56 52 66 1.039 f= 
Lycopersicon esculentum 3381 57 52 66 1.063 = 
Oryza sativa 3377 59 53 71 1.1841 + 
Homo sapiens 5025 69 nd nd 1.6811 ++ 
Escherichia coli 2904 53 nd nd 1.015 = 


* ALL = full-length sequence; CORE = conserved core regions; ES = expansion segments; nd = not determined. 


T Denotes statistically significant RSF. 


t Absence (—) or presence (+) of internal sequence similarity is determined by visual inspection of dot plots (Fig. 


2). 


The positions of the 12 expansion segments (О1— 
D12) relative to the sequence simplicity profile 
peaks for the LSU rDNA of rice are shown in 
Figure 1. The expansion segment coordinates of 
rice (unpublished data provided by John Hancock 
and Gabriel Dover) were mapped to a multiple- 
sequence alignment of the LSU rDNA from all 
seven plant species to determine position and length 
variation in the expansion segments and the con- 
served core regions. Relative positions of expansion 
segments and the core regions are conserved across 
all of the plant species analyzed. Some length vari- 


TABLE 3. 
Nucleotide composition (in %GC) given in parentheses. 


ation was observed between species for the expan- 
sion segments and the conserved core regions. Most 
of the length variation in the conserved core regions 
was due to single position insertion/deletion events. 
The largest insertion/deletion event observed in 
any conserved core was a single gap involving 4 
contiguous positions in the alignment. In contrast, 
most of the length variation in the expansion seg- 
ments was due to insertion/deletion events involv- 
ing at least 2 contiguous positions in the multiple- 
sequence alignment. The largest insertion/deletion 
event was observed in expansion segment D2 and 


Relative simplicity factor values for expansion segments (longer than 68-bp) of plant LSU of rDNA. 








Expansion segment [median size in bp] 























DI D2 рз D4 DS D6 D7aD7b D8 рө DIO Dill DI2 
Taxon [150] [224] [107] [8] [39] [27] [431] [27] [141] [24] [75] [4] [130] 
Arabidopsis thaliana 0.716 0.958 0.803 — - 1.464* — 0.404 —  1.270* 
(62) (68) (59) (73) (66) (61) 
Brassica napus 0.663 0.981 0.607 0.800 — 0.845 — 0.963 
(62) (66) (61) (66) (62) (61) 
Sinapsis alba 0.730 0.992 0.705 — 1.127 — 0.845 — 0.855 
(62) (68) (56) (68) (62) (61) 
Citrus limon 1.028 1.229 0.887 1.105* — 0.055 —  1.280* 
(70) (75) (61) (76) (69) (72) 
Fragaria х апапаѕѕа 0.685 1.051 0.859 1.068* — 0.130 — 1.178 
(66) (72) (56) (72) (64) (66) 
Lycopersicon esculentum 0.912 0.964 1.010 — 1.049* — 0.150 — 1.081 
(68) (73) (58) (75) (64) (67) 
Oryza sativa 0.587 1.032 1.017 1.4731 — 0.443 —  1.685f 
(71) (79) (64) (82) (68) (80) 


* RSF is not statistically significant after Markov correction for dinucleotide composition. 


T Statistically significant RSF. 
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Locations of divergent domains (D1-D12) in the large subunit rDNA of Oryza sativa mapped to 


regions of sequence simplicity measured by SIMPLE34 (Hancock & Armstrong, 1994). Length of sequence given 


in base pairs. 


involved 10 contiguous nucleotide positions in the 
alignment. 

Overall sequence variation was much greater in 
the expansion segments than in the conserved core 
regions. Of 1034 nucleotide positions comprising 
the expansion segments, 431 (42%) were variable. 
Seventy of the 431 variable positions were due to 
insertion/ deletion events. Of the 2423 nucleotides 
in the conserved core regions, 239 (10%) were 
variable. Forty-eight of the 239 variable positions 
in the conserved core regions were due to insertion / 
deletion events. 


DOT PLOT ANALYSIS 


A visual comparison of the dot plots of the LSU 
in plants to those for humans and bacteria reveals 
that the levels of internal sequence similarity are 
generally low in plants (Fig. 2). Oryza shows the 
strongest patterns of inter-expansion segment sim- 
ilarity, which appear as intensely dark regions on 
the dot plot. These regions of high similarity cor- 
respond well to the peaks of sequence simplicity 
associated with the expansion segments in the se- 
quence simplicity profiles that border the dot plot. 
Interestingly, the dot plot of Citrus limon also 
shows a relatively strong pattern of expansion seg- 
ment similarity. Because the RSF values for Citrus 
are not statistically significant, the similarity is most 
likely due to similar nucleotide composition in the 
expansion segment regions. This result highlights 
the importance of interpreting dot plot analyses in 
conjunction with sequence simplicity analyses. 


FIGURE 24-1 (pp. 240-244). 


SUMMARY 


The LSU of rDNA is often touted, anecdotally 
and in the literature, as an alternative gene to the 
small subunit (16/188 or SSU) of rDNA for phy- 
logenetic analysis because it is longer and has a 
higher overall rate of sequence variability and, 
therefore, may provide more phylogenetically in- 
formative characters than the SSU for certain sys- 
tematic questions (Larson, 1991). As demonstrated 
in this paper, and in the others cited herein, the 
patterns of nucleotide sequence change in the LSU 
are more complicated than is indicated by this 
"face value" assessment. Although the expansion 
segment regions of the LSU are potentially prob- 
lematic as molecular markers in phylogenetic anal- 
yses, the conserved core regions may be useful as 
a source of molecular characters for exploring deep 
evolutionary divergences in the plant kingdom. 

The analysis of the LSU for plants reveals the 
following: (1) nonrandom distribution and bias in 
nucleotide composition in expansion segments of 
all plant taxa examined, (2) significant cryptic se- 
quence simplicity for Oryza and borderline sim- 
plicity for Arabidopsis, Fragaria, and Sinapsis, 
and (3) similarity among expansion segments (in 
Oryza) beyond that due to shared nucleotide com- 
position. Nucleotide bias will complicate estimates 
of divergence times and will result in high levels 
of homoplasy due to convergence. For example, 
interspecific dot plot analyses by Hancock & Dover 
(1988, 1990) revealed a high degree of sequence 
similarity between human and rice due to nucle- 


— 


Intra-specific dot plots of the complete sequences of LSU rDNA for seven plant 


species. Sequence simplicity profiles generated by SIMPLE34 border each dot plot. The higher the peak, the greater 
the level of sequence simplicity. Peaks of sequence simplicity generally correspond to expansion segments in eukaryotes. 
Dot plots for Homo sapiens and Escherichia coli are shown for comparison (see text). 
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FIGURE 2. Continued. 
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C. Oryza sativa 
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G. Citrus limon 
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I. Lycopersicon esculentum 


FIGURE 2. Continued. 


otide composition bias. Slippage-like mechanisms 
that lead to shuffling of small direct repeats in the 
expansion segments during turnover will obliterate 
evidence of ancestry in some cases, making ho- 
mology assessment simply guesswork. This problem 
will be especially acute for deep divergences. Mo- 
lecular coevolution among expansion segments will 
result in non-independence of characters. In their 
analysis of the secondary structure of the LSU in 
Drosophila, Hancock et al. (1988) described ev- 
idence for compensatory change among positions 
in the expansion segments. Weighting schemes have 
been proposed for compensatory mutations in the 
base-paired stem regions of rDNA based on sec- 
ondary structure predictions (Wheeler & Honey- 
cutt, 1988; Dixon & Hillis, 1993). Similar weight- 
ing schemes may be useful for expansion segments 
as well. The extent to which there is compensatory 
mutation in the expansion segments in plants is not 
known and awaits detailed modeling of secondary 
structure. In his molecular systematic work with 
the LSU of rDNA in salamanders, Larson (1991) 
explored the rates of sequence change in the ex- 
pansion segments and concluded that violations to 
assumptions of parsimony were not severe enough 
to affect adversely the outcome of the analysis. He 


also proposed that a rate-invariant parsimony based 
on compositional statistics (Sidow & Wilson, 1990) 
be employed for phylogenetic analyses based on 
the LSU to compensate for the nucleotide com- 
position bias observed in this molecule, especially 
for deep divergences. These post-alignment ap- 
proaches, however, do not address the difficulties 
presented by the LSU for the basic assumptions of 
homology and independence among characters. 

Because the number of plant species for which 
complete LSU sequence data currently are avail- 
able is small and not representative of the overall 
phylogenetic diversity of the plant kingdom, we 
are sequencing the complete LSU of rDNA from 
representatives of all the major plant lineages. These 
data will enable us to determine if the patterns of 
sequence change described in this paper are found 
throughout the plant kingdom or are lineage-spe- 
cific, and to test empirically the phylogenetic utility 
of both the expansion segments and the conserved 
core regions in the LSU of rDNA in plants. 
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